Valores missing, outliers y correlación¶
En este notebook continuamos con el preprocesamiento y análisis exploratorio del dataset procesado previamente. Nos enfocamos en profundizar en las relaciones entre variables y en la preparación de datos para modelos predictivos. Nos enfocaremos en el análisis de distribuciones, correlaciones y segmentaciones, además de explorar posibles valores atípicos y relaciones clave entre las variables y el objetivo (TARGET).
Vamos a seguir los siguientes pasos:
- Cambio de tipos de variables
- Separación en train y test
- Visualización descriptiva de los datos
- Tratamiento de las variables continuas: correlaciones de pearson, estudio de outliers y estudio de valores missing
- Tratamiento de las variables categoricas: estudio de correlaciones con vCramer, relleno de valores missing
Librerias¶
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.io as pio
import scipy.stats as stats
from scipy.stats import chi2_contingency
import warnings
import sys
Funciones¶
sys.path.append('../src/')
import functions as f
sys.path.remove('../src/')
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 10000)
pd.set_option('display.width', 10000)
#Constantes
seed= 12345
Carga de los datos¶
# Carga de los datos
pd_data = pd.read_csv("../data/df_data_1.csv").set_index('SK_ID_CURR')
pd_data.head()
| COMMONAREA_AVG | NONLIVINGAPARTMENTS_AVG | FONDKAPREMONT_MODE | LIVINGAPARTMENTS_AVG | FLOORSMIN_AVG | YEARS_BUILD_AVG | OWN_CAR_AGE | LANDAREA_AVG | BASEMENTAREA_AVG | EXT_SOURCE_1 | NONLIVINGAREA_AVG | ELEVATORS_AVG | WALLSMATERIAL_MODE | APARTMENTS_AVG | ENTRANCES_AVG | LIVINGAREA_AVG | HOUSETYPE_MODE | FLOORSMAX_AVG | YEARS_BEGINEXPLUATATION_AVG | TOTALAREA_MODE | EMERGENCYSTATE_MODE | OCCUPATION_TYPE | EXT_SOURCE_3 | ORGANIZATION_TYPE | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | NAME_TYPE_SUITE | DEF_60_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_30_CNT_SOCIAL_CIRCLE | EXT_SOURCE_2 | AMT_GOODS_PRICE | AMT_ANNUITY | CODE_GENDER | CNT_FAM_MEMBERS | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | REGION_RATING_CLIENT_W_CITY | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_INCOME_TOTAL | CNT_CHILDREN | FLAG_OWN_REALTY | FLAG_OWN_CAR | FLAG_DOCUMENT_12 | AMT_CREDIT | WEEKDAY_APPR_PROCESS_START | NAME_INCOME_TYPE | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | REGION_RATING_CLIENT | FLAG_EMAIL | FLAG_PHONE | FLAG_CONT_MOBILE | FLAG_WORK_PHONE | NAME_CONTRACT_TYPE | FLAG_EMP_PHONE | FLAG_MOBIL | DAYS_ID_PUBLISH | DAYS_REGISTRATION | DAYS_EMPLOYED | DAYS_BIRTH | REGION_POPULATION_RELATIVE | NAME_HOUSING_TYPE | NAME_FAMILY_STATUS | NAME_EDUCATION_TYPE | TARGET | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SK_ID_CURR | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 100002 | 0.0143 | 0.0000 | reg oper account | 0.0202 | 0.1250 | 0.6192 | NaN | 0.0369 | 0.0369 | 0.083037 | 0.0000 | 0.00 | Stone, brick | 0.0247 | 0.0690 | 0.0190 | block of flats | 0.0833 | 0.9722 | 0.0149 | No | Laborers | 0.139376 | Business Entity Type 3 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | Unaccompanied | 2.0 | 2.0 | 2.0 | 2.0 | 0.262949 | 351000.0 | 24700.5 | M | 1.0 | -1134.0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 202500.0 | 0 | Y | N | 0 | 406597.5 | 3 | Working | 10 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 1 | 1 | 0 | Cash loans | 1 | 1 | -2120 | -3648.0 | -637 | -9461 | 0.018801 | House / apartment | Single / not married | Secondary / secondary special | 1 |
| 100003 | 0.0605 | 0.0039 | reg oper account | 0.0773 | 0.3333 | 0.7960 | NaN | 0.0130 | 0.0529 | 0.311267 | 0.0098 | 0.08 | Block | 0.0959 | 0.0345 | 0.0549 | block of flats | 0.2917 | 0.9851 | 0.0714 | No | Core staff | NaN | School | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | Family | 0.0 | 1.0 | 0.0 | 1.0 | 0.622246 | 1129500.0 | 35698.5 | F | 2.0 | -828.0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 270000.0 | 0 | N | N | 0 | 1293502.5 | 1 | State servant | 11 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | Cash loans | 1 | 1 | -291 | -1186.0 | -1188 | -16765 | 0.003541 | House / apartment | Married | Higher education | 0 |
| 100004 | NaN | NaN | NaN | NaN | NaN | NaN | 26.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Laborers | 0.729567 | Government | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | Unaccompanied | 0.0 | 0.0 | 0.0 | 0.0 | 0.555912 | 135000.0 | 6750.0 | M | 1.0 | -815.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 67500.0 | 0 | Y | Y | 0 | 135000.0 | 1 | Working | 9 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 1 | 1 | 1 | Revolving loans | 1 | 1 | -2531 | -4260.0 | -225 | -19046 | 0.010032 | House / apartment | Single / not married | Secondary / secondary special | 0 |
| 100006 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Laborers | NaN | Business Entity Type 3 | NaN | NaN | NaN | NaN | NaN | NaN | Unaccompanied | 0.0 | 2.0 | 0.0 | 2.0 | 0.650442 | 297000.0 | 29686.5 | F | 2.0 | -617.0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 135000.0 | 0 | Y | N | 0 | 312682.5 | 3 | Working | 17 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 1 | 0 | Cash loans | 1 | 1 | -2437 | -9833.0 | -3039 | -19005 | 0.008019 | House / apartment | Civil marriage | Secondary / secondary special | 0 |
| 100007 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Core staff | NaN | Religion | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | Unaccompanied | 0.0 | 0.0 | 0.0 | 0.0 | 0.322738 | 513000.0 | 21865.5 | M | 1.0 | -1106.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 121500.0 | 0 | Y | N | 0 | 513000.0 | 4 | Working | 11 | 0 | 0 | 0 | 0 | 1 | 1 | 2 | 0 | 0 | 1 | 0 | Cash loans | 1 | 1 | -3458 | -4311.0 | -3038 | -19932 | 0.028663 | House / apartment | Single / not married | Secondary / secondary special | 0 |
pd_data.info(verbose=True, show_counts=True)
<class 'pandas.core.frame.DataFrame'> Index: 307511 entries, 100002 to 456255 Data columns (total 93 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 COMMONAREA_AVG 92646 non-null float64 1 NONLIVINGAPARTMENTS_AVG 93997 non-null float64 2 FONDKAPREMONT_MODE 97216 non-null object 3 LIVINGAPARTMENTS_AVG 97312 non-null float64 4 FLOORSMIN_AVG 98869 non-null float64 5 YEARS_BUILD_AVG 103023 non-null float64 6 OWN_CAR_AGE 104582 non-null float64 7 LANDAREA_AVG 124921 non-null float64 8 BASEMENTAREA_AVG 127568 non-null float64 9 EXT_SOURCE_1 134133 non-null float64 10 NONLIVINGAREA_AVG 137829 non-null float64 11 ELEVATORS_AVG 143620 non-null float64 12 WALLSMATERIAL_MODE 151170 non-null object 13 APARTMENTS_AVG 151450 non-null float64 14 ENTRANCES_AVG 152683 non-null float64 15 LIVINGAREA_AVG 153161 non-null float64 16 HOUSETYPE_MODE 153214 non-null object 17 FLOORSMAX_AVG 154491 non-null float64 18 YEARS_BEGINEXPLUATATION_AVG 157504 non-null float64 19 TOTALAREA_MODE 159080 non-null float64 20 EMERGENCYSTATE_MODE 161756 non-null object 21 OCCUPATION_TYPE 211120 non-null object 22 EXT_SOURCE_3 246546 non-null float64 23 ORGANIZATION_TYPE 252137 non-null object 24 AMT_REQ_CREDIT_BUREAU_HOUR 265992 non-null float64 25 AMT_REQ_CREDIT_BUREAU_DAY 265992 non-null float64 26 AMT_REQ_CREDIT_BUREAU_WEEK 265992 non-null float64 27 AMT_REQ_CREDIT_BUREAU_MON 265992 non-null float64 28 AMT_REQ_CREDIT_BUREAU_QRT 265992 non-null float64 29 AMT_REQ_CREDIT_BUREAU_YEAR 265992 non-null float64 30 NAME_TYPE_SUITE 306219 non-null object 31 DEF_60_CNT_SOCIAL_CIRCLE 306490 non-null float64 32 OBS_60_CNT_SOCIAL_CIRCLE 306490 non-null float64 33 DEF_30_CNT_SOCIAL_CIRCLE 306490 non-null float64 34 OBS_30_CNT_SOCIAL_CIRCLE 306490 non-null float64 35 EXT_SOURCE_2 306851 non-null float64 36 AMT_GOODS_PRICE 307233 non-null float64 37 AMT_ANNUITY 307499 non-null float64 38 CODE_GENDER 307507 non-null object 39 CNT_FAM_MEMBERS 307509 non-null float64 40 DAYS_LAST_PHONE_CHANGE 307510 non-null float64 41 FLAG_DOCUMENT_10 307511 non-null int64 42 FLAG_DOCUMENT_2 307511 non-null int64 43 FLAG_DOCUMENT_3 307511 non-null int64 44 FLAG_DOCUMENT_4 307511 non-null int64 45 FLAG_DOCUMENT_5 307511 non-null int64 46 FLAG_DOCUMENT_6 307511 non-null int64 47 FLAG_DOCUMENT_7 307511 non-null int64 48 FLAG_DOCUMENT_8 307511 non-null int64 49 FLAG_DOCUMENT_9 307511 non-null int64 50 REGION_RATING_CLIENT_W_CITY 307511 non-null int64 51 FLAG_DOCUMENT_11 307511 non-null int64 52 FLAG_DOCUMENT_13 307511 non-null int64 53 FLAG_DOCUMENT_14 307511 non-null int64 54 FLAG_DOCUMENT_15 307511 non-null int64 55 FLAG_DOCUMENT_16 307511 non-null int64 56 FLAG_DOCUMENT_17 307511 non-null int64 57 FLAG_DOCUMENT_18 307511 non-null int64 58 FLAG_DOCUMENT_19 307511 non-null int64 59 FLAG_DOCUMENT_20 307511 non-null int64 60 FLAG_DOCUMENT_21 307511 non-null int64 61 AMT_INCOME_TOTAL 307511 non-null float64 62 CNT_CHILDREN 307511 non-null int64 63 FLAG_OWN_REALTY 307511 non-null object 64 FLAG_OWN_CAR 307511 non-null object 65 FLAG_DOCUMENT_12 307511 non-null int64 66 AMT_CREDIT 307511 non-null float64 67 WEEKDAY_APPR_PROCESS_START 307511 non-null int64 68 NAME_INCOME_TYPE 307511 non-null object 69 HOUR_APPR_PROCESS_START 307511 non-null int64 70 REG_REGION_NOT_LIVE_REGION 307511 non-null int64 71 REG_REGION_NOT_WORK_REGION 307511 non-null int64 72 LIVE_REGION_NOT_WORK_REGION 307511 non-null int64 73 REG_CITY_NOT_LIVE_CITY 307511 non-null int64 74 REG_CITY_NOT_WORK_CITY 307511 non-null int64 75 LIVE_CITY_NOT_WORK_CITY 307511 non-null int64 76 REGION_RATING_CLIENT 307511 non-null int64 77 FLAG_EMAIL 307511 non-null int64 78 FLAG_PHONE 307511 non-null int64 79 FLAG_CONT_MOBILE 307511 non-null int64 80 FLAG_WORK_PHONE 307511 non-null int64 81 NAME_CONTRACT_TYPE 307511 non-null object 82 FLAG_EMP_PHONE 307511 non-null int64 83 FLAG_MOBIL 307511 non-null int64 84 DAYS_ID_PUBLISH 307511 non-null int64 85 DAYS_REGISTRATION 307511 non-null float64 86 DAYS_EMPLOYED 307511 non-null int64 87 DAYS_BIRTH 307511 non-null int64 88 REGION_POPULATION_RELATIVE 307511 non-null float64 89 NAME_HOUSING_TYPE 307511 non-null object 90 NAME_FAMILY_STATUS 307511 non-null object 91 NAME_EDUCATION_TYPE 307511 non-null object 92 TARGET 307511 non-null int64 dtypes: float64(37), int64(41), object(15) memory usage: 220.5+ MB
Tras procesar los datos en el notebook anterior, contamos con un dataset depurado que ha sido limpiado de columnas redundantes. El dataset cuenta ahora con 93 columnas y 307,511 filas, como se puede observar al realizar un análisis inicial. Este análisis preliminar confirma que el formato del dataset es adecuado para continuar con la exploración y transformación.
Cambio de tipo a variables categóricas¶
Muchas de las columnas tienen naturaleza categórica, pero están almacenadas como objetos (object). Para optimizar su manejo, transformamos estas columnas a categoricas, lo que mejora el rendimiento del análisis, reduce el uso de memoria y facilita operaciones estadísticas.
pd_data = pd_data.astype({col: 'category' for col in pd_data.select_dtypes(include='object').columns})
Por ejemplo, la columna NAME_CONTRACT_TYPE contiene categorías como Cash loans y Revolving loans, que serán útiles para analizar patrones de comportamiento financiero. Mientras tanto, FONDKAPREMONT_MODE incluye datos categóricos relevantes como reg oper account, que también puede ser interpretado en análisis posteriores.
Para comenzar con el análisis de las variables, clasificamos las columnas del dataset según su tipo: booleanas, categóricas y numéricas. Este paso es fundamental para decidir los métodos de análisis y preprocesamiento que aplicaremos a cada tipo de variable.
pd_data_bool, pd_data_cat, pd_data_num = f.tipos_vars(pd_data, False)
En el conjunto de datos, algunas variables clasificadas inicialmente como categóricas contienen valores que en realidad corresponden a datos numéricos. Estas variables necesitan ser trasladadas al grupo de variables numéricas para garantizar su tratamiento adecuado en etapas posteriores.
variables_to_move = [
'CNT_FAM_MEMBERS', 'CNT_CHILDREN', 'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE',
'DEF_60_CNT_SOCIAL_CIRCLE', 'OBS_30_CNT_SOCIAL_CIRCLE', 'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_HOUR',
'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_YEAR',
'HOUR_APPR_PROCESS_START'
]
Eliminamos las variables identificadas como numéricas del grupo categórico y las añadimos al grupo de variables numéricas. Este proceso asegura que las columnas se utilicen de manera coherente con su naturaleza.
pd_data_cat = [var for var in pd_data_cat if var not in variables_to_move]
pd_data_num.extend([var for var in variables_to_move if var in pd_data.columns])
Tras este ajuste, confirmamos que las variables han sido correctamente reclasificadas. Esto permite evitar errores durante el análisis exploratorio y el modelado posterior.
Para garantizar cálculos precisos y evitar errores de tipo, convertimos todas las variables numéricas al tipo de dato float.
pd_data[pd_data_num] = pd_data[pd_data_num].astype(float)
Este cambio estandariza las variables numéricas, permitiendo la aplicación uniforme de operaciones estadísticas y matemáticas.
Las variables booleanas FLAG_OWN_CAR, FLAG_OWN_REALTY y EMERGENCYSTATE_MODE, contienen valores de texto ("Yes/No", "Y/N"). Para simplificar su análisis y modelado, convertimos estas variables a formato binario (1 para 'YES/Y' y 0 para 'NO/N').
pd_data['FLAG_OWN_CAR'] = pd_data['FLAG_OWN_CAR'].apply(lambda x: 1 if x == 'Y' else 0)
pd_data['FLAG_OWN_REALTY'] = pd_data['FLAG_OWN_REALTY'].apply(lambda x: 1 if x == 'Y' else 0)
pd_data['EMERGENCYSTATE_MODE'] = pd_data['EMERGENCYSTATE_MODE'].apply(lambda x: 1 if x == 'Yes' else 0)
Además, cuando al final del notebook anterior (01), sustituimos los valores XNA por missing en la variable CODE_GENDER, se convirtió también en una variable booleana, por lo que vamos a asignar ceros al valor M (hombre) y unos al valor F (mujer).
pd_data['CODE_GENDER'] = pd_data['CODE_GENDER'].apply(lambda x: 1 if x == 'F' else 0)
Por otro lado, la variable NAME_CONTRACT_TYPE, que habíamos clasificado originalmente como booleana, la vamos a considerar como categórica, ya que sus posibles valores son los strings Cash loans y Revolving Loans.
pd_data_cat.append('NAME_CONTRACT_TYPE')
pd_data_bool.remove('NAME_CONTRACT_TYPE')
Separación en Train y Test estratificado¶
Antes de realizar la separación en conjuntos de Train y Test, es importante conocer la distribución de la variable objetivo (TARGET). Esto nos permitirá garantizar que se mantenga una proporción similar entre ambas clases en los dos subconjuntos.
data_target = pd_data['TARGET']\
.value_counts(normalize=True)\
.mul(100).rename('percent').reset_index()
data_target_conteo=pd_data['TARGET'].value_counts()
data_plot_target = pd.merge(data_target, data_target_conteo, on='TARGET')
data_plot_target
| TARGET | percent | count | |
|---|---|---|---|
| 0 | 0 | 91.927118 | 282686 |
| 1 | 1 | 8.072882 | 24825 |
fig, ax = plt.subplots()
sns.barplot(data=data_plot_target, x='TARGET', y='percent', ax=ax, hue='TARGET', legend=False)
ax.set_title('Conteo de valores de la variable Target', fontdict={'fontsize':15}) #,'fontname':'Montserrat','weight':'bold'
ax.set_ylabel('Percentage')
ax.set_xlabel('Value')
for i in ax.containers:
ax.bar_label(i,fmt='{:,.2f}')
plt.show()
Este gráfico de barras muestra la proporción de cada clase (0 y 1). Observamos que la clase 0 es dominante (91.93%) frente a la clase 1 (8.07%). Este desbalance puede influir en el rendimiento de los modelos y debe tenerse en cuenta a la hora de separar los datos.
Para evaluar correctamente el rendimiento de los modelos, dividimos los datos en dos conjuntos:
Train: Para entrenar el modelo.
Test: Para validar el rendimiento en datos no vistos. En este caso, utilizamos una separación estratificada para mantener la misma proporción de las clases (0 y 1) en ambos conjuntos
from sklearn.model_selection import train_test_split
X_data_train, X_data_test, y_data_train, y_data_test = train_test_split(pd_data.drop('TARGET',axis=1)
,pd_data['TARGET']
,stratify=pd_data['TARGET']
,test_size=0.2
,random_state=seed)
pd_data_train = pd.concat([X_data_train, y_data_train],axis=1)
pd_data_test = pd.concat([X_data_test, y_data_test],axis=1)
train_test_split divide los datos en proporción 80% (entrenamiento) y 20% (prueba). La opción stratify asegura que la proporción de las clases en TARGET sea consistente con la distribución que tiene en el dataset.
print('== Train\n', pd_data_train['TARGET'].value_counts(normalize=True))
print('== Test\n', pd_data_test['TARGET'].value_counts(normalize=True))
== Train TARGET 0 0.919271 1 0.080729 Name: proportion, dtype: float64 == Test TARGET 0 0.919272 1 0.080728 Name: proportion, dtype: float64
Visualización descriptiva de los datos¶
Distribución del resto de variables¶
Hacemos gráficas de la distribución de cada variable, y de su distribución condicionada al valor de 'TARGET', para ver si podemos extraer algunas conclusiones provisionales sobre qué variables son más importantes. Para ello hemos creado una función que permite graficar las variables de forma adecuada a su estructura.
warnings.filterwarnings('ignore')
for i in list(pd_data_train.columns):
if i in pd_data_num:
f.custom_plot(pd_data_train, col_name=i, is_cont=True, target='TARGET')
elif ((i in pd_data_bool) | (i in pd_data_cat)) & (i!='TARGET'):
f.custom_plot(pd_data_train, col_name=i, is_cont=False, target='TARGET')
Al analizar las variables visualmente, surgen algunos patrones interesantes. Algunas anotaciones clave basadas en los gráficos incluyen:
Relación entre ingresos y riesgo de incumplimiento: Se observa que el fraude o riesgo de incumplimiento (TARGET=1) parece correlacionarse con ingresos más bajos o intermedios (NAME_INCOME_TYPE y AMT_INCOME_TOTAL). Sin embargo, las distribuciones también reflejan la predominancia de categorías como "Working" y "Commercial associate", lo que podría sesgar los resultados.
Impacto de la ocupación y nivel educativo: Las categorías ocupacionales (OCCUPATION_TYPE) como "Low-skill laborers" muestran un porcentaje más alto de incumplimientos. Algo similar ocurre con niveles educativos más bajos (NAME_EDUCATION_TYPE), como "Secondary / secondary special", que presenta una mayor proporción de fraudes en comparación con niveles superiores. Esto refuerza la hipótesis de que las características socioeconómicas juegan un papel relevante en la predicción del riesgo.
Diferencias demográficas importantes: En las variables como DAYS_BIRTH y DAYS_EMPLOYED, hay indicios de que las personas más jóvenes tienden a tener una mayor proporción de TARGET=1, mientras que la estabilidad laboral prolongada está asociada a menores niveles de incumplimiento.
Regiones y movilidad: Variables como REG_CITY_NOT_WORK_CITY y LIVE_CITY_NOT_WORK_CITY muestran una mayor proporción de fraudes en clientes que residen o trabajan en diferentes ciudades. Esto puede reflejar riesgos asociados a movilidad geográfica o inestabilidad en el lugar de residencia.
Documentos presentados: Las variables relacionadas con FLAG_DOCUMENT no muestran una relación clara con el TARGET, aunque podrían indicar problemas en la documentación presentada por ciertos clientes. Esto podría ser más relevante en combinaciones con otras variables.
Regiones con mayor riesgo: Variables como REGION_POPULATION_RELATIVE muestran que las áreas menos densamente pobladas tienden a tener un mayor riesgo asociado. Sin embargo, las diferencias entre los valores de TARGET no son drásticas.
Riesgo y edad de la vivienda: Las variables relacionadas con la antigüedad de las propiedades (YEARS_BEGINEXPLUATATION_AVG) reflejan una ligera tendencia hacia un mayor riesgo de incumplimiento en propiedades más antiguas. Esto puede estar relacionado con la capacidad económica de los propietarios para mantener estas propiedades.
Desbalance de la variable objetivo: La variable objetivo está muy desbalanceada, con una proporción muy baja de TARGET=1. Esto sugiere que será necesario implementar técnicas de re-balanceo, como oversampling o ajustes específicos en el modelado.
Los gráficos revelan tendencias importantes que deben tenerse en cuenta en el preprocesamiento y modelado. Muchas variables parecen tener potencial predictivo, aunque la presencia de outliers, valores imputados y distribuciones altamente sesgadas requiere especial atención. Es importante profundizar en las interacciones entre variables clave y realizar pruebas estadísticas para confirmar estas hipótesis antes de proceder al modelado.
Tratamiento de las variables continuas¶
Tratamiento de outliers¶
Los valores outlier se pueden sustituir por la media, mediana, valores extremos (media+3std o media-3std). Tras el siguiente análisis, hemos decidido como primera iteración dejarlos sin sustituir. Una vez llegue al modelo puedo realizar iteraciones utilizando diferentes métodos para comprobar si mejora el modelo
f.get_deviation_of_mean_perc(pd_data_train, pd_data_num, target='TARGET', multiplier=3)
| 0.0 | 1.0 | variable | sum_outlier_values | porcentaje_sum_null_values | |
|---|---|---|---|---|---|
| 0 | 0.951237 | 0.048763 | COMMONAREA_AVG | 1374 | 0.005585 |
| 1 | 0.942808 | 0.057192 | NONLIVINGAPARTMENTS_AVG | 577 | 0.002345 |
| 2 | 0.948107 | 0.051893 | LIVINGAPARTMENTS_AVG | 1426 | 0.005797 |
| 3 | 0.957717 | 0.042283 | FLOORSMIN_AVG | 473 | 0.001923 |
| 4 | 0.923158 | 0.076842 | YEARS_BUILD_AVG | 950 | 0.003862 |
| 5 | 0.914980 | 0.085020 | OWN_CAR_AGE | 2717 | 0.011044 |
| 6 | 0.936272 | 0.063728 | LANDAREA_AVG | 1679 | 0.006825 |
| 7 | 0.945352 | 0.054648 | BASEMENTAREA_AVG | 1592 | 0.006471 |
| 8 | 0.947799 | 0.052201 | NONLIVINGAREA_AVG | 1954 | 0.007943 |
| 9 | 0.956456 | 0.043544 | ELEVATORS_AVG | 1975 | 0.008028 |
| 10 | 0.950495 | 0.049505 | APARTMENTS_AVG | 2424 | 0.009853 |
| 11 | 0.940090 | 0.059910 | ENTRANCES_AVG | 1786 | 0.007260 |
| 12 | 0.951437 | 0.048563 | LIVINGAREA_AVG | 2574 | 0.010463 |
| 13 | 0.956226 | 0.043774 | FLOORSMAX_AVG | 2056 | 0.008357 |
| 14 | 0.911817 | 0.088183 | YEARS_BEGINEXPLUATATION_AVG | 567 | 0.002305 |
| 15 | 0.955490 | 0.044510 | TOTALAREA_MODE | 2696 | 0.010959 |
| 16 | 0.961401 | 0.038599 | AMT_GOODS_PRICE | 3368 | 0.013691 |
| 17 | 0.962696 | 0.037304 | AMT_ANNUITY | 2359 | 0.009589 |
| 18 | 0.955426 | 0.044574 | DAYS_LAST_PHONE_CHANGE | 516 | 0.002097 |
| 19 | 0.943662 | 0.056338 | AMT_INCOME_TOTAL | 213 | 0.000866 |
| 20 | 0.959833 | 0.040167 | AMT_CREDIT | 2639 | 0.010727 |
| 21 | 0.964942 | 0.035058 | DAYS_REGISTRATION | 599 | 0.002435 |
| 22 | 0.959401 | 0.040599 | REGION_POPULATION_RELATIVE | 6749 | 0.027434 |
| 23 | 0.899189 | 0.100811 | CNT_FAM_MEMBERS | 3204 | 0.013024 |
| 24 | 0.898623 | 0.101377 | CNT_CHILDREN | 3413 | 0.013874 |
| 25 | 0.881130 | 0.118870 | DEF_30_CNT_SOCIAL_CIRCLE | 5485 | 0.022296 |
| 26 | 0.911427 | 0.088573 | OBS_60_CNT_SOCIAL_CIRCLE | 4787 | 0.019459 |
| 27 | 0.874041 | 0.125959 | DEF_60_CNT_SOCIAL_CIRCLE | 3128 | 0.012715 |
| 28 | 0.911490 | 0.088510 | OBS_30_CNT_SOCIAL_CIRCLE | 4926 | 0.020024 |
| 29 | 0.921172 | 0.078828 | AMT_REQ_CREDIT_BUREAU_WEEK | 6825 | 0.027743 |
| 30 | 0.925170 | 0.074830 | AMT_REQ_CREDIT_BUREAU_HOUR | 1323 | 0.005378 |
| 31 | 0.945904 | 0.054096 | AMT_REQ_CREDIT_BUREAU_MON | 2551 | 0.010370 |
| 32 | 0.910907 | 0.089093 | AMT_REQ_CREDIT_BUREAU_QRT | 1852 | 0.007528 |
| 33 | 0.904318 | 0.095682 | AMT_REQ_CREDIT_BUREAU_DAY | 1181 | 0.004801 |
| 34 | 0.908453 | 0.091547 | AMT_REQ_CREDIT_BUREAU_YEAR | 2709 | 0.011012 |
| 35 | 0.894523 | 0.105477 | HOUR_APPR_PROCESS_START | 493 | 0.002004 |
Al evaluar las variables continuas, se observa que existe un porcentaje significativo de valores atípicos en varias de ellas, lo que podría influir de manera importante en el comportamiento del modelo. A continuación, se destacan los puntos clave del análisis:
Presencia de Outliers en Variables Clave:
Las variables como COMMONAREA_AVG, NONLIVINGAPARTMENTS_AVG y LIVINGAPARTMENTS_AVG presentan valores atípicos en un rango cercano al 0.05% del total de registros. Aunque estos porcentajes pueden parecer bajos, podrían tener un impacto en variables con distribuciones no uniformes.
Variables relacionadas con características de vivienda como OWN_CAR_AGE, LANDAREA_AVG, y BASEMENTAREA_AVG también presentan niveles de outliers que superan el 0.01%. Estas variables son críticas, ya que podrían estar correlacionadas con la capacidad de pago del solicitante.
Impacto de Variables Socioeconómicas:
Variables como DAYS_EMPLOYED, DAYS_LAST_PHONE_CHANGE, y REGION_POPULATION_RELATIVE presentan outliers más pronunciados. Esto podría indicar que ciertos grupos poblacionales o personas en situaciones específicas tienen comportamientos atípicos en términos de empleo o cambios en contacto.
AMT_CREDIT y AMT_INCOME_TOTAL, relacionadas con la cantidad de crédito solicitado y el ingreso anual, contienen valores extremos mínimos pero significativos. Dada su relevancia en modelos de riesgo, su tratamiento cuidadoso es clave.
Análisis de Frecuencias Altas de Outliers:
- Las variables CNT_FAM_MEMBERS y CNT_CHILDREN, aunque no tienen un porcentaje muy elevado de valores atípicos, reflejan patrones en el tamaño de la familia y la cantidad de hijos, lo que podría indicar inconsistencias en ciertos grupos poblacionales específicos.
Por eso mismo no optamos por modificar estos valores en este modelo, ya que no parece que tengan una influencia muy significativa en nuestra variable objetivo. Pero insistimos de nuevo, en que se podrían realizar iteraciones utilizando diferentes métodos para comprobar si mejora el modelo.
Análisis de relaciones entre las variables¶
Vamos a evaluar las relaciones entre las variables numéricas del conjunto de datos mediante la matriz de correlaciones.
Correlation Matrix para variables numéricas¶
corr = pd.concat([pd_data_train[pd_data_num],pd_data_train['TARGET']], axis=1).corr(method='pearson')
corr
| COMMONAREA_AVG | NONLIVINGAPARTMENTS_AVG | LIVINGAPARTMENTS_AVG | FLOORSMIN_AVG | YEARS_BUILD_AVG | OWN_CAR_AGE | LANDAREA_AVG | BASEMENTAREA_AVG | EXT_SOURCE_1 | NONLIVINGAREA_AVG | ELEVATORS_AVG | APARTMENTS_AVG | ENTRANCES_AVG | LIVINGAREA_AVG | FLOORSMAX_AVG | YEARS_BEGINEXPLUATATION_AVG | TOTALAREA_MODE | EXT_SOURCE_3 | EXT_SOURCE_2 | AMT_GOODS_PRICE | AMT_ANNUITY | DAYS_LAST_PHONE_CHANGE | AMT_INCOME_TOTAL | AMT_CREDIT | DAYS_ID_PUBLISH | DAYS_REGISTRATION | DAYS_EMPLOYED | DAYS_BIRTH | REGION_POPULATION_RELATIVE | CNT_FAM_MEMBERS | CNT_CHILDREN | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | OBS_30_CNT_SOCIAL_CIRCLE | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_YEAR | HOUR_APPR_PROCESS_START | TARGET | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| COMMONAREA_AVG | 1.000000 | 0.100469 | 0.530912 | 0.294396 | 0.230494 | -0.036155 | 0.258204 | 0.403875 | 0.039327 | 0.222409 | 0.521230 | 0.536240 | 0.327349 | 0.544130 | 0.402248 | 0.086884 | 0.551081 | -0.001335 | 0.053072 | 0.049718 | 0.055926 | -0.006807 | 0.086892 | 0.048513 | -0.000351 | 0.024987 | -0.010976 | 0.004699 | 0.165302 | -0.000420 | -0.001948 | -0.011888 | -0.019063 | -0.013479 | -0.019489 | -0.008359 | 0.007218 | 0.024211 | -0.009351 | 0.005283 | -0.009823 | 0.044369 | -0.019782 |
| NONLIVINGAPARTMENTS_AVG | 0.100469 | 1.000000 | 0.155995 | 0.077764 | 0.071242 | -0.025476 | 0.064150 | 0.092110 | 0.016657 | 0.218461 | 0.123018 | 0.193630 | 0.063206 | 0.135749 | 0.113698 | 0.034691 | 0.143505 | 0.006790 | 0.019464 | 0.012403 | 0.019505 | 0.000306 | 0.025382 | 0.011495 | -0.003616 | 0.033233 | -0.002648 | 0.000238 | 0.023006 | 0.002642 | 0.003799 | 0.002419 | -0.002723 | 0.000265 | -0.003077 | -0.004075 | 0.000093 | 0.001647 | 0.003089 | 0.000649 | 0.000197 | 0.014044 | -0.004158 |
| LIVINGAPARTMENTS_AVG | 0.530912 | 0.155995 | 1.000000 | 0.435703 | 0.331857 | -0.045788 | 0.417132 | 0.647632 | 0.050335 | 0.288581 | 0.811352 | 0.943963 | 0.569113 | 0.879194 | 0.587505 | 0.142244 | 0.848336 | 0.004743 | 0.079635 | 0.060266 | 0.074095 | -0.005429 | 0.104597 | 0.057289 | -0.000653 | 0.021963 | -0.021713 | 0.009600 | 0.195264 | -0.006695 | -0.008962 | -0.015144 | -0.024301 | -0.016299 | -0.024605 | -0.004948 | 0.002999 | 0.034519 | -0.007409 | 0.008109 | -0.012161 | 0.076559 | -0.024013 |
| FLOORSMIN_AVG | 0.294396 | 0.077764 | 0.435703 | 1.000000 | 0.358867 | -0.074004 | 0.145948 | 0.219464 | 0.077853 | 0.150552 | 0.508580 | 0.441624 | 0.036137 | 0.461835 | 0.739743 | 0.162768 | 0.450510 | 0.002468 | 0.110622 | 0.077715 | 0.096028 | -0.008299 | 0.136660 | 0.075332 | -0.007402 | 0.019105 | -0.015649 | -0.000523 | 0.294473 | -0.003171 | -0.010453 | -0.020188 | -0.029920 | -0.019772 | -0.030394 | -0.000438 | 0.005449 | 0.041122 | -0.005084 | 0.005512 | -0.007285 | 0.112511 | -0.034858 |
| YEARS_BUILD_AVG | 0.230494 | 0.071242 | 0.331857 | 0.358867 | 1.000000 | -0.048208 | 0.180048 | 0.248387 | 0.020383 | 0.128014 | 0.342473 | 0.340212 | 0.089555 | 0.354994 | 0.519770 | 0.477032 | 0.358968 | 0.016258 | 0.012544 | 0.041372 | 0.033022 | 0.011331 | 0.041509 | 0.035006 | -0.007431 | 0.163101 | -0.003792 | 0.025962 | -0.053781 | 0.041806 | 0.031144 | -0.010540 | -0.000594 | -0.011541 | -0.000444 | -0.003822 | 0.001431 | -0.000726 | -0.006938 | 0.002743 | -0.024971 | -0.013931 | -0.020534 |
| OWN_CAR_AGE | -0.036155 | -0.025476 | -0.045788 | -0.074004 | -0.048208 | 1.000000 | -0.017769 | -0.028407 | -0.086686 | -0.029857 | -0.063733 | -0.047589 | -0.014242 | -0.055211 | -0.082431 | -0.000139 | -0.057055 | -0.013074 | -0.079392 | -0.104878 | -0.099073 | 0.004009 | -0.123613 | -0.095198 | 0.008095 | -0.025635 | 0.027046 | 0.007749 | -0.080985 | -0.013092 | 0.010799 | 0.007258 | 0.003352 | 0.012106 | 0.003248 | 0.002613 | -0.005774 | -0.021725 | -0.018056 | -0.008969 | -0.015801 | -0.069669 | 0.037821 |
| LANDAREA_AVG | 0.258204 | 0.064150 | 0.417132 | 0.145948 | 0.180048 | -0.017769 | 1.000000 | 0.462161 | 0.008756 | 0.164469 | 0.370073 | 0.491916 | 0.504060 | 0.495390 | 0.216667 | 0.073174 | 0.485942 | 0.013631 | 0.021915 | 0.013292 | 0.009182 | -0.003278 | -0.002688 | 0.006431 | -0.008346 | 0.001815 | -0.009698 | 0.001475 | -0.054471 | 0.000903 | -0.003293 | -0.003829 | -0.002923 | -0.003336 | -0.002967 | 0.011277 | -0.003032 | 0.016973 | 0.008446 | 0.007372 | -0.013149 | 0.013557 | -0.011140 |
| BASEMENTAREA_AVG | 0.403875 | 0.092110 | 0.647632 | 0.219464 | 0.248387 | -0.028407 | 0.462161 | 1.000000 | 0.042072 | 0.267067 | 0.564962 | 0.679892 | 0.653790 | 0.693061 | 0.329318 | 0.086155 | 0.673985 | 0.011103 | 0.048498 | 0.042163 | 0.043311 | -0.009265 | 0.014944 | 0.037382 | -0.015720 | -0.020542 | -0.001156 | -0.005428 | 0.096177 | -0.001777 | -0.007473 | -0.011970 | -0.012988 | -0.013713 | -0.013220 | -0.003813 | -0.002100 | 0.023288 | -0.001170 | 0.003733 | -0.013529 | 0.038674 | -0.020987 |
| EXT_SOURCE_1 | 0.039327 | 0.016657 | 0.050335 | 0.077853 | 0.020383 | -0.086686 | 0.008756 | 0.042072 | 1.000000 | 0.035329 | 0.075366 | 0.056377 | 0.023855 | 0.071064 | 0.095811 | 0.004505 | 0.070429 | 0.190412 | 0.214980 | 0.176111 | 0.121035 | -0.130435 | 0.023450 | 0.168960 | -0.132497 | -0.181136 | 0.290146 | -0.601868 | 0.100125 | -0.097436 | -0.140413 | -0.028141 | -0.024224 | -0.029206 | -0.024711 | -0.004562 | -0.004717 | 0.031299 | -0.004093 | -0.005997 | 0.005880 | 0.033482 | -0.154130 |
| NONLIVINGAREA_AVG | 0.222409 | 0.218461 | 0.288581 | 0.150552 | 0.128014 | -0.029857 | 0.164469 | 0.267067 | 0.035329 | 1.000000 | 0.283100 | 0.298400 | 0.167951 | 0.301632 | 0.253485 | 0.009326 | 0.365069 | -0.002482 | 0.048660 | 0.038534 | 0.046847 | -0.001539 | 0.071301 | 0.034250 | 0.003967 | 0.048773 | -0.010866 | 0.005618 | 0.074816 | 0.000517 | -0.000126 | -0.011817 | -0.017122 | -0.011388 | -0.017211 | -0.005866 | 0.005633 | 0.015441 | -0.002197 | 0.004280 | -0.008970 | 0.045419 | -0.013034 |
| ELEVATORS_AVG | 0.521230 | 0.123018 | 0.811352 | 0.508580 | 0.342473 | -0.063733 | 0.370073 | 0.564962 | 0.075366 | 0.283100 | 1.000000 | 0.836896 | 0.404818 | 0.867172 | 0.679242 | 0.078577 | 0.844945 | 0.008947 | 0.116073 | 0.083747 | 0.101445 | -0.011358 | 0.040543 | 0.080827 | -0.009157 | -0.000804 | -0.011120 | -0.002623 | 0.280606 | -0.002713 | -0.007769 | -0.021126 | -0.030973 | -0.021989 | -0.031313 | 0.000221 | 0.000728 | 0.046409 | -0.004969 | 0.004896 | -0.019256 | 0.103232 | -0.033765 |
| APARTMENTS_AVG | 0.536240 | 0.193630 | 0.943963 | 0.441624 | 0.340212 | -0.047589 | 0.491916 | 0.679892 | 0.056377 | 0.298400 | 0.836896 | 1.000000 | 0.613986 | 0.913874 | 0.618257 | 0.100419 | 0.893720 | 0.006739 | 0.090627 | 0.063743 | 0.076203 | -0.010119 | 0.031020 | 0.059531 | -0.007416 | 0.012643 | -0.017215 | 0.003071 | 0.206095 | -0.011798 | -0.013420 | -0.012649 | -0.022997 | -0.015079 | -0.023422 | 0.000008 | 0.002606 | 0.037665 | -0.002925 | 0.005730 | -0.017085 | 0.081627 | -0.027657 |
| ENTRANCES_AVG | 0.327349 | 0.063206 | 0.569113 | 0.036137 | 0.089555 | -0.014242 | 0.504060 | 0.653790 | 0.023855 | 0.167951 | 0.404818 | 0.613986 | 1.000000 | 0.620037 | 0.088954 | 0.041408 | 0.597438 | 0.013878 | 0.032956 | 0.019652 | 0.016963 | -0.013513 | 0.004916 | 0.015880 | -0.016777 | -0.063743 | 0.004242 | -0.012474 | 0.035945 | -0.002363 | -0.007480 | -0.002820 | -0.000642 | -0.005582 | -0.000822 | 0.000404 | -0.005264 | 0.013637 | 0.000560 | 0.006350 | -0.010672 | 0.020854 | -0.018575 |
| LIVINGAREA_AVG | 0.544130 | 0.135749 | 0.879194 | 0.461835 | 0.354994 | -0.055211 | 0.495390 | 0.693061 | 0.071064 | 0.301632 | 0.867172 | 0.913874 | 0.620037 | 1.000000 | 0.630246 | 0.091201 | 0.925154 | 0.006750 | 0.097277 | 0.076165 | 0.089473 | -0.011308 | 0.035539 | 0.071231 | -0.011285 | 0.006055 | -0.012711 | -0.002453 | 0.212895 | -0.005091 | -0.010141 | -0.015987 | -0.024911 | -0.017872 | -0.025228 | -0.001677 | 0.004844 | 0.037872 | -0.002515 | 0.011179 | -0.019517 | 0.082304 | -0.032207 |
| FLOORSMAX_AVG | 0.402248 | 0.113698 | 0.587505 | 0.739743 | 0.519770 | -0.082431 | 0.216667 | 0.329318 | 0.095811 | 0.253485 | 0.679242 | 0.618257 | 0.088954 | 0.630246 | 1.000000 | 0.126299 | 0.632131 | 0.003654 | 0.135747 | 0.107791 | 0.128867 | -0.007626 | 0.054546 | 0.102533 | -0.009191 | 0.049463 | -0.015921 | 0.000856 | 0.323741 | -0.004221 | -0.010567 | -0.026768 | -0.035389 | -0.027018 | -0.035684 | -0.000060 | 0.005211 | 0.049910 | -0.000972 | 0.004846 | -0.020153 | 0.117631 | -0.042450 |
| YEARS_BEGINEXPLUATATION_AVG | 0.086884 | 0.034691 | 0.142244 | 0.162768 | 0.477032 | -0.000139 | 0.073174 | 0.086155 | 0.004505 | 0.009326 | 0.078577 | 0.100419 | 0.041408 | 0.091201 | 0.126299 | 1.000000 | 0.100667 | 0.001845 | 0.009301 | 0.009117 | 0.015332 | 0.002849 | 0.005454 | 0.008278 | -0.004066 | 0.007685 | 0.008228 | 0.000129 | -0.006471 | 0.006488 | 0.005476 | -0.005378 | 0.000036 | -0.005525 | 0.000165 | 0.005772 | 0.001499 | -0.002404 | 0.002633 | -0.002304 | -0.008381 | -0.009800 | -0.009794 |
| TOTALAREA_MODE | 0.551081 | 0.143505 | 0.848336 | 0.450510 | 0.358968 | -0.057055 | 0.485942 | 0.673985 | 0.070429 | 0.365069 | 0.844945 | 0.893720 | 0.597438 | 0.925154 | 0.632131 | 0.100667 | 1.000000 | 0.007334 | 0.094617 | 0.076349 | 0.089925 | -0.009732 | 0.037218 | 0.071736 | -0.010882 | 0.017816 | -0.015775 | -0.000427 | 0.201603 | -0.003306 | -0.008657 | -0.016029 | -0.024667 | -0.017922 | -0.024990 | -0.001454 | 0.006151 | 0.037354 | -0.003248 | 0.012101 | -0.019690 | 0.077278 | -0.031773 |
| EXT_SOURCE_3 | -0.001335 | 0.006790 | 0.004743 | 0.002468 | 0.016258 | -0.013074 | 0.013631 | 0.011103 | 0.190412 | -0.002482 | 0.008947 | 0.006739 | 0.013878 | 0.006750 | 0.003654 | 0.001845 | 0.007334 | 1.000000 | 0.110273 | 0.047787 | 0.030505 | -0.074038 | -0.027986 | 0.043289 | -0.130886 | -0.107668 | 0.112850 | -0.204777 | -0.006984 | -0.027076 | -0.042484 | -0.035200 | -0.000293 | -0.034077 | 0.000059 | -0.022977 | -0.000365 | -0.006640 | -0.026066 | -0.006789 | -0.070741 | -0.039922 | -0.177583 |
| EXT_SOURCE_2 | 0.053072 | 0.019464 | 0.079635 | 0.110622 | 0.012544 | -0.079392 | 0.021915 | 0.048498 | 0.214980 | 0.048660 | 0.116073 | 0.090627 | 0.032956 | 0.097277 | 0.135747 | 0.009301 | 0.094617 | 0.110273 | 1.000000 | 0.140033 | 0.126214 | -0.196768 | 0.055071 | 0.131612 | -0.051859 | -0.060154 | -0.019304 | -0.093234 | 0.198612 | -0.002425 | -0.019035 | -0.031039 | -0.018631 | -0.032968 | -0.018944 | 0.001939 | -0.002964 | 0.051317 | -0.002802 | 0.000718 | -0.022717 | 0.156753 | -0.160270 |
| AMT_GOODS_PRICE | 0.049718 | 0.012403 | 0.060266 | 0.077715 | 0.041372 | -0.104878 | 0.013292 | 0.042163 | 0.176111 | 0.038534 | 0.083747 | 0.063743 | 0.019652 | 0.076165 | 0.107791 | 0.009117 | 0.076349 | 0.047787 | 0.140033 | 1.000000 | 0.775592 | -0.076783 | 0.147060 | 0.986965 | -0.010654 | 0.010919 | -0.063384 | -0.053341 | 0.103730 | 0.060779 | -0.001518 | -0.022161 | 0.001343 | -0.025512 | 0.001323 | -0.002725 | -0.003477 | 0.057638 | 0.014847 | 0.004494 | -0.051418 | 0.062376 | -0.039155 |
| AMT_ANNUITY | 0.055926 | 0.019505 | 0.074095 | 0.096028 | 0.033022 | -0.099073 | 0.009182 | 0.043311 | 0.121035 | 0.046847 | 0.101445 | 0.076203 | 0.016963 | 0.089473 | 0.128867 | 0.015332 | 0.089925 | 0.030505 | 0.126214 | 0.775592 | 1.000000 | -0.063913 | 0.176525 | 0.770470 | 0.010050 | 0.037742 | -0.103950 | 0.009687 | 0.118301 | 0.075295 | 0.021193 | -0.022550 | -0.010578 | -0.024532 | -0.010797 | 0.013609 | 0.002561 | 0.038942 | 0.008759 | 0.001914 | -0.012079 | 0.052745 | -0.012435 |
| DAYS_LAST_PHONE_CHANGE | -0.006807 | 0.000306 | -0.005429 | -0.008299 | 0.011331 | 0.004009 | -0.003278 | -0.009265 | -0.130435 | -0.001539 | -0.011358 | -0.010119 | -0.013513 | -0.011308 | -0.007626 | 0.002849 | -0.009732 | -0.074038 | -0.196768 | -0.076783 | -0.063913 | 1.000000 | -0.016266 | -0.073897 | 0.088123 | 0.057657 | 0.023036 | 0.082290 | -0.045421 | -0.027036 | -0.006144 | 0.000704 | -0.013185 | 0.002622 | -0.012751 | -0.003836 | -0.002439 | -0.039937 | -0.000606 | 0.000911 | -0.111318 | -0.016934 | 0.055109 |
| AMT_INCOME_TOTAL | 0.086892 | 0.025382 | 0.104597 | 0.136660 | 0.041509 | -0.123613 | -0.002688 | 0.014944 | 0.023450 | 0.071301 | 0.040543 | 0.031020 | 0.004916 | 0.035539 | 0.054546 | 0.005454 | 0.037218 | -0.027986 | 0.055071 | 0.147060 | 0.176525 | -0.016266 | 1.000000 | 0.144556 | 0.007503 | 0.025130 | -0.059076 | 0.025740 | 0.068677 | 0.014148 | 0.010854 | -0.011837 | -0.011630 | -0.011894 | -0.011713 | 0.002176 | 0.000592 | 0.022262 | 0.004380 | 0.003116 | 0.010377 | 0.033412 | -0.002190 |
| AMT_CREDIT | 0.048513 | 0.011495 | 0.057289 | 0.075332 | 0.035006 | -0.095198 | 0.006431 | 0.037382 | 0.168960 | 0.034250 | 0.080827 | 0.059531 | 0.015880 | 0.071231 | 0.102533 | 0.008278 | 0.071736 | 0.043289 | 0.131612 | 0.986965 | 0.770470 | -0.073897 | 0.144556 | 1.000000 | -0.008092 | 0.009058 | -0.065374 | -0.055325 | 0.099843 | 0.062696 | 0.002377 | -0.021113 | 0.001149 | -0.024801 | 0.001099 | -0.002886 | -0.004365 | 0.055544 | 0.014357 | 0.004074 | -0.048971 | 0.052911 | -0.029698 |
| DAYS_ID_PUBLISH | -0.000351 | -0.003616 | -0.000653 | -0.007402 | -0.007431 | 0.008095 | -0.008346 | -0.015720 | -0.132497 | 0.003967 | -0.009157 | -0.007416 | -0.016777 | -0.011285 | -0.009191 | -0.004066 | -0.010882 | -0.130886 | -0.051859 | -0.010654 | 0.010050 | 0.088123 | 0.007503 | -0.008092 | 1.000000 | 0.101686 | -0.272392 | 0.272716 | -0.003389 | -0.020354 | -0.027310 | 0.004245 | -0.011316 | 0.005330 | -0.010896 | -0.001416 | 0.003361 | -0.010847 | -0.004276 | -0.001228 | -0.034659 | 0.030861 | 0.051954 |
| DAYS_REGISTRATION | 0.024987 | 0.033233 | 0.021963 | 0.019105 | 0.163101 | -0.025635 | 0.001815 | -0.020542 | -0.181136 | 0.048773 | -0.000804 | 0.012643 | -0.063743 | 0.006055 | 0.049463 | 0.007685 | 0.017816 | -0.107668 | -0.060154 | 0.010919 | 0.037742 | 0.057657 | 0.025130 | 0.009058 | 0.101686 | 1.000000 | -0.209892 | 0.330983 | -0.054004 | 0.173308 | 0.183413 | 0.002167 | 0.007957 | 0.004519 | 0.007935 | 0.001060 | -0.002593 | -0.011527 | -0.001138 | 0.001442 | -0.023933 | -0.010449 | 0.043475 |
| DAYS_EMPLOYED | -0.010976 | -0.002648 | -0.021713 | -0.015649 | -0.003792 | 0.027046 | -0.009698 | -0.001156 | 0.290146 | -0.010866 | -0.011120 | -0.017215 | 0.004242 | -0.012711 | -0.015921 | 0.008228 | -0.015775 | 0.112850 | -0.019304 | -0.063384 | -0.103950 | 0.023036 | -0.059076 | -0.065374 | -0.272392 | -0.209892 | 1.000000 | -0.616611 | -0.004936 | -0.232925 | -0.240373 | 0.016665 | 0.006509 | 0.014412 | 0.006391 | 0.003387 | -0.003885 | -0.033226 | 0.012968 | -0.000304 | 0.050394 | -0.092061 | -0.043668 |
| DAYS_BIRTH | 0.004699 | 0.000238 | 0.009600 | -0.000523 | 0.025962 | 0.007749 | 0.001475 | -0.005428 | -0.601868 | 0.005618 | -0.002623 | 0.003071 | -0.012474 | -0.002453 | 0.000856 | 0.000129 | -0.000427 | -0.204777 | -0.093234 | -0.053341 | 0.009687 | 0.082290 | 0.025740 | -0.055325 | 0.272716 | 0.330983 | -0.616611 | 1.000000 | -0.030070 | 0.278701 | 0.331193 | 0.000727 | 0.005357 | 0.002809 | 0.005732 | -0.000687 | 0.003848 | 0.001379 | -0.009455 | 0.002456 | -0.070873 | 0.091661 | 0.077406 |
| REGION_POPULATION_RELATIVE | 0.165302 | 0.023006 | 0.195264 | 0.294473 | -0.053781 | -0.080985 | -0.054471 | 0.096177 | 0.100125 | 0.074816 | 0.280606 | 0.206095 | 0.035945 | 0.212895 | 0.323741 | -0.006471 | 0.201603 | -0.006984 | 0.198612 | 0.103730 | 0.118301 | -0.045421 | 0.068677 | 0.099843 | -0.003389 | -0.054004 | -0.004936 | -0.030070 | 1.000000 | -0.024800 | -0.026291 | 0.007131 | -0.011246 | 0.002973 | -0.011759 | -0.001682 | -0.001431 | 0.078221 | -0.001758 | 0.001420 | 0.000782 | 0.170127 | -0.035827 |
| CNT_FAM_MEMBERS | -0.000420 | 0.002642 | -0.006695 | -0.003171 | 0.041806 | -0.013092 | 0.000903 | -0.001777 | -0.097436 | 0.000517 | -0.002713 | -0.011798 | -0.002363 | -0.005091 | -0.004221 | 0.006488 | -0.003306 | -0.027076 | -0.002425 | 0.060779 | 0.075295 | -0.027036 | 0.014148 | 0.062696 | -0.020354 | 0.173308 | -0.232925 | 0.278701 | -0.024800 | 1.000000 | 0.879056 | -0.004910 | 0.023685 | -0.006535 | 0.024053 | -0.000613 | 0.000281 | -0.009149 | -0.004341 | -0.002056 | -0.029068 | -0.011766 | 0.009687 |
| CNT_CHILDREN | -0.001948 | 0.003799 | -0.008962 | -0.010453 | 0.031144 | 0.010799 | -0.003293 | -0.007473 | -0.140413 | -0.000126 | -0.007769 | -0.013420 | -0.007480 | -0.010141 | -0.010567 | 0.005476 | -0.008657 | -0.042484 | -0.019035 | -0.001518 | 0.021193 | -0.006144 | 0.010854 | 0.002377 | -0.027310 | 0.183413 | -0.240373 | 0.331193 | -0.026291 | 0.879056 | 1.000000 | -0.002826 | 0.013709 | -0.002898 | 0.014110 | -0.001563 | -0.000801 | -0.011591 | -0.007221 | -0.000592 | -0.041864 | -0.006374 | 0.019716 |
| DEF_30_CNT_SOCIAL_CIRCLE | -0.011888 | 0.002419 | -0.015144 | -0.020188 | -0.010540 | 0.007258 | -0.003829 | -0.011970 | -0.028141 | -0.011817 | -0.021126 | -0.012649 | -0.002820 | -0.015987 | -0.026768 | -0.005378 | -0.016029 | -0.035200 | -0.031039 | -0.022161 | -0.022550 | 0.000704 | -0.011837 | -0.021113 | 0.004245 | 0.002167 | 0.016665 | 0.000727 | 0.007131 | -0.004910 | -0.002826 | 1.000000 | 0.336766 | 0.860355 | 0.334538 | -0.001304 | 0.001758 | -0.000441 | -0.001320 | -0.000924 | 0.018832 | -0.005510 | 0.031189 |
| OBS_60_CNT_SOCIAL_CIRCLE | -0.019063 | -0.002723 | -0.024301 | -0.029920 | -0.000594 | 0.003352 | -0.002923 | -0.012988 | -0.024224 | -0.017122 | -0.030973 | -0.022997 | -0.000642 | -0.024911 | -0.035389 | 0.000036 | -0.024667 | -0.000293 | -0.018631 | 0.001343 | -0.010578 | -0.013185 | -0.011630 | 0.001149 | -0.011316 | 0.007957 | 0.006509 | 0.005357 | -0.011246 | 0.023685 | 0.013709 | 0.336766 | 1.000000 | 0.259729 | 0.998516 | 0.000333 | 0.000296 | 0.000356 | 0.003885 | -0.002046 | 0.034196 | -0.008900 | 0.008350 |
| DEF_60_CNT_SOCIAL_CIRCLE | -0.013479 | 0.000265 | -0.016299 | -0.019772 | -0.011541 | 0.012106 | -0.003336 | -0.013713 | -0.029206 | -0.011388 | -0.021989 | -0.015079 | -0.005582 | -0.017872 | -0.027018 | -0.005525 | -0.017922 | -0.034077 | -0.032968 | -0.025512 | -0.024532 | 0.002622 | -0.011894 | -0.024801 | 0.005330 | 0.004519 | 0.014412 | 0.002809 | 0.002973 | -0.006535 | -0.002898 | 0.860355 | 0.259729 | 1.000000 | 0.257677 | -0.002330 | 0.000479 | -0.002529 | 0.000147 | -0.001830 | 0.016295 | -0.008482 | 0.030469 |
| OBS_30_CNT_SOCIAL_CIRCLE | -0.019489 | -0.003077 | -0.024605 | -0.030394 | -0.000444 | 0.003248 | -0.002967 | -0.013220 | -0.024711 | -0.017211 | -0.031313 | -0.023422 | -0.000822 | -0.025228 | -0.035684 | 0.000165 | -0.024990 | 0.000059 | -0.018944 | 0.001323 | -0.010797 | -0.012751 | -0.011713 | 0.001099 | -0.010896 | 0.007935 | 0.006391 | 0.005732 | -0.011759 | 0.024053 | 0.014110 | 0.334538 | 0.998516 | 0.257677 | 1.000000 | 0.000239 | 0.000426 | 0.000255 | 0.004027 | -0.002077 | 0.033832 | -0.009024 | 0.008436 |
| AMT_REQ_CREDIT_BUREAU_WEEK | -0.008359 | -0.004075 | -0.004948 | -0.000438 | -0.003822 | 0.002613 | 0.011277 | -0.003813 | -0.004562 | -0.005866 | 0.000221 | 0.000008 | 0.000404 | -0.001677 | -0.000060 | 0.005772 | -0.001454 | -0.022977 | 0.001939 | -0.002725 | 0.013609 | -0.003836 | 0.002176 | -0.002886 | -0.001416 | 0.001060 | 0.003387 | -0.000687 | -0.001682 | -0.000613 | -0.001563 | -0.001304 | 0.000333 | -0.002330 | 0.000239 | 1.000000 | 0.003606 | -0.013516 | -0.014174 | 0.216424 | 0.017445 | -0.002892 | 0.001596 |
| AMT_REQ_CREDIT_BUREAU_HOUR | 0.007218 | 0.000093 | 0.002999 | 0.005449 | 0.001431 | -0.005774 | -0.003032 | -0.002100 | -0.004717 | 0.005633 | 0.000728 | 0.002606 | -0.005264 | 0.004844 | 0.005211 | 0.001499 | 0.006151 | -0.000365 | -0.002964 | -0.003477 | 0.002561 | -0.002439 | 0.000592 | -0.004365 | 0.003361 | -0.002593 | -0.003885 | 0.003848 | -0.001431 | 0.000281 | -0.000801 | 0.001758 | 0.000296 | 0.000479 | 0.000426 | 0.003606 | 1.000000 | 0.000012 | -0.002151 | 0.237863 | -0.004563 | -0.016420 | -0.000368 |
| AMT_REQ_CREDIT_BUREAU_MON | 0.024211 | 0.001647 | 0.034519 | 0.041122 | -0.000726 | -0.021725 | 0.016973 | 0.023288 | 0.031299 | 0.015441 | 0.046409 | 0.037665 | 0.013637 | 0.037872 | 0.049910 | -0.002404 | 0.037354 | -0.006640 | 0.051317 | 0.057638 | 0.038942 | -0.039937 | 0.022262 | 0.055544 | -0.010847 | -0.011527 | -0.033226 | 0.001379 | 0.078221 | -0.009149 | -0.011591 | -0.000441 | 0.000356 | -0.002529 | 0.000255 | -0.013516 | 0.000012 | 1.000000 | -0.008448 | -0.005345 | -0.004582 | 0.037578 | -0.012187 |
| AMT_REQ_CREDIT_BUREAU_QRT | -0.009351 | 0.003089 | -0.007409 | -0.005084 | -0.006938 | -0.018056 | 0.008446 | -0.001170 | -0.004093 | -0.002197 | -0.004969 | -0.002925 | 0.000560 | -0.002515 | -0.000972 | 0.002633 | -0.003248 | -0.026066 | -0.002802 | 0.014847 | 0.008759 | -0.000606 | 0.004380 | 0.014357 | -0.004276 | -0.001138 | 0.012968 | -0.009455 | -0.001758 | -0.004341 | -0.007221 | -0.001320 | 0.003885 | 0.000147 | 0.004027 | -0.014174 | -0.002151 | -0.008448 | 1.000000 | -0.003376 | 0.071375 | 0.000488 | -0.002705 |
| AMT_REQ_CREDIT_BUREAU_DAY | 0.005283 | 0.000649 | 0.008109 | 0.005512 | 0.002743 | -0.008969 | 0.007372 | 0.003733 | -0.005997 | 0.004280 | 0.004896 | 0.005730 | 0.006350 | 0.011179 | 0.004846 | -0.002304 | 0.012101 | -0.006789 | 0.000718 | 0.004494 | 0.001914 | 0.000911 | 0.003116 | 0.004074 | -0.001228 | 0.001442 | -0.000304 | 0.002456 | 0.001420 | -0.002056 | -0.000592 | -0.000924 | -0.002046 | -0.001830 | -0.002077 | 0.216424 | 0.237863 | -0.005345 | -0.003376 | 1.000000 | -0.002481 | 0.000177 | 0.003584 |
| AMT_REQ_CREDIT_BUREAU_YEAR | -0.009823 | 0.000197 | -0.012161 | -0.007285 | -0.024971 | -0.015801 | -0.013149 | -0.013529 | 0.005880 | -0.008970 | -0.019256 | -0.017085 | -0.010672 | -0.019517 | -0.020153 | -0.008381 | -0.019690 | -0.070741 | -0.022717 | -0.051418 | -0.012079 | -0.111318 | 0.010377 | -0.048971 | -0.034659 | -0.023933 | 0.050394 | -0.070873 | 0.000782 | -0.029068 | -0.041864 | 0.018832 | 0.034196 | 0.016295 | 0.033832 | 0.017445 | -0.004563 | -0.004582 | 0.071375 | -0.002481 | 1.000000 | -0.030087 | 0.017556 |
| HOUR_APPR_PROCESS_START | 0.044369 | 0.014044 | 0.076559 | 0.112511 | -0.013931 | -0.069669 | 0.013557 | 0.038674 | 0.033482 | 0.045419 | 0.103232 | 0.081627 | 0.020854 | 0.082304 | 0.117631 | -0.009800 | 0.077278 | -0.039922 | 0.156753 | 0.062376 | 0.052745 | -0.016934 | 0.033412 | 0.052911 | 0.030861 | -0.010449 | -0.092061 | 0.091661 | 0.170127 | -0.011766 | -0.006374 | -0.005510 | -0.008900 | -0.008482 | -0.009024 | -0.002892 | -0.016420 | 0.037578 | 0.000488 | 0.000177 | -0.030087 | 1.000000 | -0.022306 |
| TARGET | -0.019782 | -0.004158 | -0.024013 | -0.034858 | -0.020534 | 0.037821 | -0.011140 | -0.020987 | -0.154130 | -0.013034 | -0.033765 | -0.027657 | -0.018575 | -0.032207 | -0.042450 | -0.009794 | -0.031773 | -0.177583 | -0.160270 | -0.039155 | -0.012435 | 0.055109 | -0.002190 | -0.029698 | 0.051954 | 0.043475 | -0.043668 | 0.077406 | -0.035827 | 0.009687 | 0.019716 | 0.031189 | 0.008350 | 0.030469 | 0.008436 | 0.001596 | -0.000368 | -0.012187 | -0.002705 | 0.003584 | 0.017556 | -0.022306 | 1.000000 |
A través de esta tabla podemos observar la correlación entre estas variables. Las variables como LIVINGAPARTMENTS_AVG y LIVINGAREA_AVG muestran correlaciones altas, indicando posible redundancia entre estas variables. Pero para que sea más visual, vamos a ilustrar gráficamente las correlaciones
f.plot_correlation_heatmap(corr)
(<Figure size 1400x1200 with 2 Axes>,
<Axes: title={'center': 'Matriz de correlaciones'}>)
En algunos algoritmos, como los ensembling de árboles (xgboost, randomforest, lightgbm, catboost, ...) no es necesario eliminar colinealidad entre variables. En otros algoritmos como glm si es necesario eliminar la colinealidad, por lo que mostramos aquí una lista de las variables más colineales, en caso de que fuera necesario tenerla en cuenta.
corr = pd_data_train[pd_data_num].corr('pearson')
new_corr = corr.abs()
new_corr.loc[:,:] = np.tril(new_corr, k=-1) # below main lower triangle of an array
new_corr = new_corr.stack().to_frame('correlation').reset_index().sort_values(by='correlation', ascending=False)
new_corr[new_corr['correlation']>0.6]
| level_0 | level_1 | correlation | |
|---|---|---|---|
| 1460 | OBS_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | 0.998516 |
| 985 | AMT_CREDIT | AMT_GOODS_PRICE | 0.986965 |
| 464 | APARTMENTS_AVG | LIVINGAPARTMENTS_AVG | 0.943963 |
| 685 | TOTALAREA_MODE | LIVINGAREA_AVG | 0.925154 |
| 557 | LIVINGAREA_AVG | APARTMENTS_AVG | 0.913874 |
| 683 | TOTALAREA_MODE | APARTMENTS_AVG | 0.893720 |
| 548 | LIVINGAREA_AVG | LIVINGAPARTMENTS_AVG | 0.879194 |
| 1289 | CNT_CHILDREN | CNT_FAM_MEMBERS | 0.879056 |
| 556 | LIVINGAREA_AVG | ELEVATORS_AVG | 0.867172 |
| 1417 | DEF_60_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | 0.860355 |
| 674 | TOTALAREA_MODE | LIVINGAPARTMENTS_AVG | 0.848336 |
| 682 | TOTALAREA_MODE | ELEVATORS_AVG | 0.844945 |
| 472 | APARTMENTS_AVG | ELEVATORS_AVG | 0.836896 |
| 422 | ELEVATORS_AVG | LIVINGAPARTMENTS_AVG | 0.811352 |
| 859 | AMT_ANNUITY | AMT_GOODS_PRICE | 0.775592 |
| 986 | AMT_CREDIT | AMT_ANNUITY | 0.770470 |
| 591 | FLOORSMAX_AVG | FLOORSMIN_AVG | 0.739743 |
| 553 | LIVINGAREA_AVG | BASEMENTAREA_AVG | 0.693061 |
| 469 | APARTMENTS_AVG | BASEMENTAREA_AVG | 0.679892 |
| 598 | FLOORSMAX_AVG | ELEVATORS_AVG | 0.679242 |
| 679 | TOTALAREA_MODE | BASEMENTAREA_AVG | 0.673985 |
| 511 | ENTRANCES_AVG | BASEMENTAREA_AVG | 0.653790 |
| 296 | BASEMENTAREA_AVG | LIVINGAPARTMENTS_AVG | 0.647632 |
| 686 | TOTALAREA_MODE | FLOORSMAX_AVG | 0.632131 |
| 601 | FLOORSMAX_AVG | LIVINGAREA_AVG | 0.630246 |
| 558 | LIVINGAREA_AVG | ENTRANCES_AVG | 0.620037 |
| 599 | FLOORSMAX_AVG | APARTMENTS_AVG | 0.618257 |
| 1160 | DAYS_BIRTH | DAYS_EMPLOYED | 0.616611 |
| 515 | ENTRANCES_AVG | APARTMENTS_AVG | 0.613986 |
| 1142 | DAYS_BIRTH | EXT_SOURCE_1 | 0.601868 |
Convertimos los valores absolutos de la matriz de correlación en una lista de pares de variables, y filtramos las correlaciones mayores a 0.6 para identificar aquellas variables que podrían causar multicolinealidad.
En el análisis de correlación, identificamos varias variables con relaciones positivas muy fuertes, destacando su redundancia potencial. Por ejemplo, OBS_30_CNT_SOCIAL_CIRCLE y OBS_60_CNT_SOCIAL_CIRCLE tienen una correlación de 0.998, lo que indica que ambas miden prácticamente lo mismo sobre las conexiones sociales del cliente. Asimismo, AMT_CREDIT y AMT_GOODS_PRICE (0.987) reflejan que el monto del crédito solicitado está estrechamente vinculado al precio del bien adquirido. Por otro lado, variables relacionadas con propiedades, como APARTMENTS_AVG y LIVINGAPARTMENTS_AVG (0.944), o TOTALAREA_MODE y LIVINGAREA_AVG (0.925), describen características similares sobre las áreas habitables. Por último, CNT_CHILDREN y CNT_FAM_MEMBERS (0.879) resaltan cómo el número de hijos influye directamente en el tamaño de la familia.
Tratamiento de valores nulos¶
Vamos a realizar un análisis descriptivo para identificar qué tan incompletos están los datos, tanto a nivel de columnas como de filas. Esto es crucial para decidir qué acciones tomar en el preprocesamiento, como eliminar filas o columnas con demasiados valores nulos o aplicar estrategias de imputación.
¿Son todos los nulos de una clase de la variable objetivo? o tienen el mismo porcentaje de la variable objetivo?
pd_series_null_columns = pd_data.isnull().sum().sort_values(ascending=False)
pd_series_null_rows = pd_data.isnull().sum(axis=1).sort_values(ascending=False)
print(pd_series_null_columns.shape, pd_series_null_rows.shape)
(93,) (307511,)
Vamos a generar dos series, una que resume los valores nulos por columna (pd_series_null_columns) y otra por fila (pd_series_null_rows). Las columnas y filas se ordenaron de mayor a menor cantidad de valores faltantes, permitiendo visualizar rápidamente qué variables son más incompletas.
pd_null_columnas = pd.DataFrame(pd_series_null_columns, columns=['nulos_columnas'])
pd_null_filas = pd.DataFrame(pd_series_null_rows, columns=['nulos_filas'])
pd_null_filas['target'] = pd_data['TARGET'].copy()
pd_null_columnas['porcentaje_columnas'] = pd_null_columnas['nulos_columnas']/pd_data.shape[0]
pd_null_filas['porcentaje_filas']= pd_null_filas['nulos_filas']/pd_data.shape[1]
pd_null_columnas[pd_null_columnas['nulos_columnas'] > 0]
| nulos_columnas | porcentaje_columnas | |
|---|---|---|
| COMMONAREA_AVG | 214865 | 0.698723 |
| NONLIVINGAPARTMENTS_AVG | 213514 | 0.694330 |
| FONDKAPREMONT_MODE | 210295 | 0.683862 |
| LIVINGAPARTMENTS_AVG | 210199 | 0.683550 |
| FLOORSMIN_AVG | 208642 | 0.678486 |
| YEARS_BUILD_AVG | 204488 | 0.664978 |
| OWN_CAR_AGE | 202929 | 0.659908 |
| LANDAREA_AVG | 182590 | 0.593767 |
| BASEMENTAREA_AVG | 179943 | 0.585160 |
| EXT_SOURCE_1 | 173378 | 0.563811 |
| NONLIVINGAREA_AVG | 169682 | 0.551792 |
| ELEVATORS_AVG | 163891 | 0.532960 |
| WALLSMATERIAL_MODE | 156341 | 0.508408 |
| APARTMENTS_AVG | 156061 | 0.507497 |
| ENTRANCES_AVG | 154828 | 0.503488 |
| LIVINGAREA_AVG | 154350 | 0.501933 |
| HOUSETYPE_MODE | 154297 | 0.501761 |
| FLOORSMAX_AVG | 153020 | 0.497608 |
| YEARS_BEGINEXPLUATATION_AVG | 150007 | 0.487810 |
| TOTALAREA_MODE | 148431 | 0.482685 |
| OCCUPATION_TYPE | 96391 | 0.313455 |
| EXT_SOURCE_3 | 60965 | 0.198253 |
| ORGANIZATION_TYPE | 55374 | 0.180072 |
| AMT_REQ_CREDIT_BUREAU_HOUR | 41519 | 0.135016 |
| AMT_REQ_CREDIT_BUREAU_DAY | 41519 | 0.135016 |
| AMT_REQ_CREDIT_BUREAU_WEEK | 41519 | 0.135016 |
| AMT_REQ_CREDIT_BUREAU_MON | 41519 | 0.135016 |
| AMT_REQ_CREDIT_BUREAU_QRT | 41519 | 0.135016 |
| AMT_REQ_CREDIT_BUREAU_YEAR | 41519 | 0.135016 |
| NAME_TYPE_SUITE | 1292 | 0.004201 |
| OBS_60_CNT_SOCIAL_CIRCLE | 1021 | 0.003320 |
| OBS_30_CNT_SOCIAL_CIRCLE | 1021 | 0.003320 |
| DEF_30_CNT_SOCIAL_CIRCLE | 1021 | 0.003320 |
| DEF_60_CNT_SOCIAL_CIRCLE | 1021 | 0.003320 |
| EXT_SOURCE_2 | 660 | 0.002146 |
| AMT_GOODS_PRICE | 278 | 0.000904 |
| AMT_ANNUITY | 12 | 0.000039 |
| CODE_GENDER | 4 | 0.000013 |
| CNT_FAM_MEMBERS | 2 | 0.000007 |
| DAYS_LAST_PHONE_CHANGE | 1 | 0.000003 |
Vamos a ver la distribución de los nulos con respecto a la variable objetivo, para ver si nos está aportando información relevante o los podemos sustituir. En este caso no eliminamos las filas con valores nulos, porque una gran cantidad de filas los contienen.
f.get_percent_null_values_target(pd_data_train, pd_data_num, target='TARGET')
| TARGET | 0.0 | 1.0 | variable | sum_null_values | porcentaje_sum_null_values |
|---|---|---|---|---|---|
| TARGET | 0.0 | 1.0 | |||
| 0 | 0.914195 | 0.085805 | COMMONAREA_AVG | 171669 | 0.697819 |
| 1 | 0.914040 | 0.085960 | NONLIVINGAPARTMENTS_AVG | 170615 | 0.693534 |
| 2 | 0.913672 | 0.086328 | LIVINGAPARTMENTS_AVG | 167940 | 0.682661 |
| 3 | 0.913639 | 0.086361 | FLOORSMIN_AVG | 166719 | 0.677697 |
| 4 | 0.913240 | 0.086760 | YEARS_BUILD_AVG | 163417 | 0.664275 |
| 5 | 0.915260 | 0.084740 | OWN_CAR_AGE | 162520 | 0.660629 |
| 6 | 0.911542 | 0.088458 | LANDAREA_AVG | 145810 | 0.592704 |
| 7 | 0.910606 | 0.089394 | BASEMENTAREA_AVG | 143679 | 0.584042 |
| 8 | 0.914834 | 0.085166 | EXT_SOURCE_1 | 138647 | 0.563587 |
| 9 | 0.909419 | 0.090581 | NONLIVINGAREA_AVG | 135581 | 0.551124 |
| 10 | 0.908887 | 0.091113 | ELEVATORS_AVG | 130881 | 0.532019 |
| 11 | 0.908144 | 0.091856 | APARTMENTS_AVG | 124597 | 0.506475 |
| 12 | 0.907972 | 0.092028 | ENTRANCES_AVG | 123680 | 0.502748 |
| 13 | 0.908218 | 0.091782 | LIVINGAREA_AVG | 123248 | 0.500992 |
| 14 | 0.907850 | 0.092150 | FLOORSMAX_AVG | 122224 | 0.496829 |
| 15 | 0.907652 | 0.092348 | YEARS_BEGINEXPLUATATION_AVG | 119808 | 0.487009 |
| 16 | 0.907417 | 0.092583 | TOTALAREA_MODE | 118542 | 0.481862 |
| 17 | 0.906160 | 0.093840 | EXT_SOURCE_3 | 48913 | 0.198827 |
| 18 | 0.918095 | 0.081905 | EXT_SOURCE_2 | 525 | 0.002134 |
| 19 | 0.922747 | 0.077253 | AMT_GOODS_PRICE | 233 | 0.000947 |
| 20 | 1.000000 | NaN | AMT_ANNUITY | 10 | 0.000041 |
| 21 | 1.000000 | NaN | DAYS_LAST_PHONE_CHANGE | 1 | 0.000004 |
| 22 | 1.000000 | NaN | CNT_FAM_MEMBERS | 2 | 0.000008 |
| 23 | 0.968637 | 0.031363 | DEF_30_CNT_SOCIAL_CIRCLE | 829 | 0.003370 |
| 24 | 0.968637 | 0.031363 | OBS_60_CNT_SOCIAL_CIRCLE | 829 | 0.003370 |
| 25 | 0.968637 | 0.031363 | DEF_60_CNT_SOCIAL_CIRCLE | 829 | 0.003370 |
| 26 | 0.968637 | 0.031363 | OBS_30_CNT_SOCIAL_CIRCLE | 829 | 0.003370 |
| 27 | 0.895681 | 0.104319 | AMT_REQ_CREDIT_BUREAU_WEEK | 33340 | 0.135524 |
| 28 | 0.895681 | 0.104319 | AMT_REQ_CREDIT_BUREAU_HOUR | 33340 | 0.135524 |
| 29 | 0.895681 | 0.104319 | AMT_REQ_CREDIT_BUREAU_MON | 33340 | 0.135524 |
| 30 | 0.895681 | 0.104319 | AMT_REQ_CREDIT_BUREAU_QRT | 33340 | 0.135524 |
| 31 | 0.895681 | 0.104319 | AMT_REQ_CREDIT_BUREAU_DAY | 33340 | 0.135524 |
| 32 | 0.895681 | 0.104319 | AMT_REQ_CREDIT_BUREAU_YEAR | 33340 | 0.135524 |
Como los nulos en la mayoría de las variables se distribuyen de forma proporcional respecto a la variable objetivo (recordemos que era un 91.93% de ceros y un 8.07% de unos), y después de comprobar que la correlación se mantiene estable, hemos decidicido sustituir los valores missing por la mediana. Las únicas variables que se distribuyen de froma asimétrica son AMT_ANNUITY, DAYS_LAST_PHONE_CHANGE y CNT_FAM_MEMBERS, y como en comparación con el total de los datos son una cantidad insignificante de NA, tampoco afecta a la correlación con la variable target el hecho de sustituir por la mediana.
pd_data_train[pd_data_num] = pd_data_train[pd_data_num].apply(lambda x: x.fillna(x.median()))
pd_data_test[pd_data_num] = pd_data_test[pd_data_num].fillna(pd_data_train[pd_data_num].median())
Volvemos a aplicar la función para verificar que ya no hay valores nulos en las variables numéricas.
f.get_percent_null_values_target(pd_data_train, pd_data_num, target='TARGET')
No existen variables con valores nulos
Volvemos a visualizar la matriz de correlaciones para ver si se ha modificado tras la imputación de los valores nulos.
corr = pd.concat([pd_data_train[pd_data_num],pd_data_train['TARGET']], axis=1).corr(method='pearson')
f.plot_correlation_heatmap(corr)
(<Figure size 1400x1200 with 2 Axes>,
<Axes: title={'center': 'Matriz de correlaciones'}>)
corr.loc['TARGET'].sort_values(ascending=False)
TARGET 1.000000 DAYS_BIRTH 0.077406 DAYS_LAST_PHONE_CHANGE 0.055109 DAYS_ID_PUBLISH 0.051954 DAYS_REGISTRATION 0.043475 DEF_30_CNT_SOCIAL_CIRCLE 0.031356 DEF_60_CNT_SOCIAL_CIRCLE 0.030611 CNT_CHILDREN 0.019716 OWN_CAR_AGE 0.016371 AMT_REQ_CREDIT_BUREAU_YEAR 0.009758 CNT_FAM_MEMBERS 0.009688 OBS_30_CNT_SOCIAL_CIRCLE 0.008783 OBS_60_CNT_SOCIAL_CIRCLE 0.008697 AMT_REQ_CREDIT_BUREAU_DAY 0.002466 AMT_REQ_CREDIT_BUREAU_WEEK -0.000659 AMT_REQ_CREDIT_BUREAU_HOUR -0.001306 AMT_INCOME_TOTAL -0.002190 YEARS_BEGINEXPLUATATION_AVG -0.004636 AMT_REQ_CREDIT_BUREAU_QRT -0.006440 NONLIVINGAPARTMENTS_AVG -0.006499 YEARS_BUILD_AVG -0.010426 LANDAREA_AVG -0.012367 AMT_ANNUITY -0.012433 AMT_REQ_CREDIT_BUREAU_MON -0.014678 ENTRANCES_AVG -0.015639 BASEMENTAREA_AVG -0.016691 COMMONAREA_AVG -0.016894 NONLIVINGAREA_AVG -0.018123 LIVINGAPARTMENTS_AVG -0.018872 FLOORSMIN_AVG -0.021834 HOUR_APPR_PROCESS_START -0.022306 APARTMENTS_AVG -0.025724 LIVINGAREA_AVG -0.029194 AMT_CREDIT -0.029698 TOTALAREA_MODE -0.029837 ELEVATORS_AVG -0.035738 REGION_POPULATION_RELATIVE -0.035827 FLOORSMAX_AVG -0.038597 AMT_GOODS_PRICE -0.039133 DAYS_EMPLOYED -0.043668 EXT_SOURCE_1 -0.098159 EXT_SOURCE_3 -0.154409 EXT_SOURCE_2 -0.160081 Name: TARGET, dtype: float64
Las correlaciones encontradas son en su mayoría débiles, lo que sugiere que ninguna variable tiene un impacto muy fuerte sobre la variable objetivo por sí sola. Sin embargo, combinaciones de estas variables podrían proporcionar información más útil en modelos predictivos.
Tratamiento de las variables categoricas¶
Vamos a convertir las variables categoricas en numericas para después obtener la correlación gracias a la correlación de spearman y el coeficiente V-Cramer.
Se han procesado las variables booleanas y categóricas para calcular su correlación con la variable objetivo TARGET.
pd_data_train[pd_data_bool].head()
| EMERGENCYSTATE_MODE | CODE_GENDER | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | FLAG_OWN_REALTY | FLAG_OWN_CAR | FLAG_DOCUMENT_12 | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | FLAG_EMAIL | FLAG_PHONE | FLAG_CONT_MOBILE | FLAG_WORK_PHONE | FLAG_EMP_PHONE | FLAG_MOBIL | TARGET | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SK_ID_CURR | |||||||||||||||||||||||||||||||||||||
| 172963 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 |
| 423589 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 0 |
| 392520 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |
| 359127 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 |
| 431456 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
corr_bool = pd_data_train[pd_data_bool].corr(method='pearson')
f.plot_correlation_heatmap(corr_bool)
(<Figure size 1400x1200 with 2 Axes>,
<Axes: title={'center': 'Matriz de correlaciones'}>)
La matriz de correlación muestra correlaciones altas entre variables como REG_CITY_NOT_WORK_CITY y LIVE_CITY_NOT_WORK_CITY (0.9), lo que sugiere redundancia, ya que ambas representan relaciones similares entre residencia y trabajo. Sin embargo, las correlaciones con la variable objetivo (TARGET) son bajas.
Otro dato interesante que podemos extraer de la gráfica, aunque no necesariamente relevante respecto a la variable target, es que FLAG_DOCUMENT_6 presenta una correlación negativa significativa (-0.6) con FLAG_EMP_PHONE, por lo que probablemente este documento contenga datos personales del cliente, entre ellos su número de teléfono.
Tambien hemos procesado las variables categóricas (no booleanas), transformadas en formato numérico para calcular sus correlaciones, utilizando una función que itera la función cramers_v de los apuntes.
corr_cats = f.corr_cat(df=pd_data_train,target='TARGET',target_transform=True)
corr_cats
| WALLSMATERIAL_MODE | HOUSETYPE_MODE | OCCUPATION_TYPE | ORGANIZATION_TYPE | NAME_TYPE_SUITE | CODE_GENDER | REGION_RATING_CLIENT_W_CITY | WEEKDAY_APPR_PROCESS_START | NAME_INCOME_TYPE | REGION_RATING_CLIENT | NAME_HOUSING_TYPE | NAME_FAMILY_STATUS | NAME_EDUCATION_TYPE | TARGET | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| WALLSMATERIAL_MODE | 1.000000 | 0.106949 | 0.026653 | 0.030347 | 0.008646 | 0.015555 | 0.098011 | 0.003890 | 0.024058 | 0.105330 | 0.035866 | 0.008624 | 0.036321 | 0.027575 |
| HOUSETYPE_MODE | 0.106949 | 1.000000 | 0.022958 | 0.021435 | 0.000000 | 0.007973 | 0.019517 | 0.000000 | 0.016142 | 0.025076 | 0.016962 | 0.010240 | 0.016470 | 0.011329 |
| OCCUPATION_TYPE | 0.026653 | 0.022958 | 1.000000 | 0.301776 | 0.020917 | 0.572457 | 0.057319 | 0.019195 | 0.144951 | 0.057187 | 0.032623 | 0.058650 | 0.219109 | 0.080576 |
| ORGANIZATION_TYPE | 0.030347 | 0.021435 | 0.301776 | 1.000000 | 0.018956 | 0.323213 | 0.093888 | 0.019775 | 0.203516 | 0.094917 | 0.067417 | 0.049357 | 0.116218 | 0.056873 |
| NAME_TYPE_SUITE | 0.008646 | 0.000000 | 0.020917 | 0.018956 | 1.000000 | 0.063592 | 0.024091 | 0.016191 | 0.021888 | 0.024646 | 0.018756 | 0.067369 | 0.022795 | 0.009530 |
| CODE_GENDER | 0.015555 | 0.007973 | 0.572457 | 0.323213 | 0.063592 | 0.999991 | 0.020063 | 0.006377 | 0.170464 | 0.019900 | 0.066897 | 0.167912 | 0.026127 | 0.055378 |
| REGION_RATING_CLIENT_W_CITY | 0.098011 | 0.019517 | 0.057319 | 0.093888 | 0.024091 | 0.020063 | 1.000000 | 0.017709 | 0.130031 | 0.956621 | 0.086834 | 0.023566 | 0.071797 | 0.060857 |
| WEEKDAY_APPR_PROCESS_START | 0.003890 | 0.000000 | 0.019195 | 0.019775 | 0.016191 | 0.006377 | 0.017709 | 1.000000 | 0.013009 | 0.018754 | 0.004755 | 0.003939 | 0.005417 | 0.004307 |
| NAME_INCOME_TYPE | 0.024058 | 0.016142 | 0.144951 | 0.203516 | 0.021888 | 0.170464 | 0.130031 | 0.013009 | 1.000000 | 0.136369 | 0.054748 | 0.113153 | 0.104543 | 0.062831 |
| REGION_RATING_CLIENT | 0.105330 | 0.025076 | 0.057187 | 0.094917 | 0.024646 | 0.019900 | 0.956621 | 0.018754 | 0.136369 | 1.000000 | 0.087524 | 0.022782 | 0.071932 | 0.059122 |
| NAME_HOUSING_TYPE | 0.035866 | 0.016962 | 0.032623 | 0.067417 | 0.018756 | 0.066897 | 0.086834 | 0.004755 | 0.054748 | 0.087524 | 1.000000 | 0.067920 | 0.041843 | 0.036127 |
| NAME_FAMILY_STATUS | 0.008624 | 0.010240 | 0.058650 | 0.049357 | 0.067369 | 0.167912 | 0.023566 | 0.003939 | 0.113153 | 0.022782 | 0.067920 | 1.000000 | 0.052976 | 0.039689 |
| NAME_EDUCATION_TYPE | 0.036321 | 0.016470 | 0.219109 | 0.116218 | 0.022795 | 0.026127 | 0.071797 | 0.005417 | 0.104543 | 0.071932 | 0.041843 | 0.052976 | 1.000000 | 0.056687 |
| TARGET | 0.027575 | 0.011329 | 0.080576 | 0.056873 | 0.009530 | 0.055378 | 0.060857 | 0.004307 | 0.062831 | 0.059122 | 0.036127 | 0.039689 | 0.056687 | 0.999973 |
plt.figure(figsize=(8,4))
sns.heatmap(corr_cats, annot=True, fmt='.3f', cmap='YlOrRd', annot_kws={"size": 8})
plt.title('Cramers V Matrix', fontdict={'size':'17'})
plt.show()
Esta matriz de Cramér's V revela información interesante sobre las relaciones entre las variables categóricas y la variable objetivo TARGET en el dataset.
Se destacan correlaciones moderadas entre algunas variables, como CODE_GENDER y ORGANIZATION_TYPE (0.572) o REGION_RATING_CLIENT y REGION_RATING_CLIENT_W_CITY (0.957). Estas correlaciones sugieren que algunas variables comparten información o describen aspectos relacionados de los clientes, como su género o calificaciones asociadas a la región de residencia. Esto indica relaciones lógicas predecibles entre las variables.
Las correlaciones con la variable TARGET son en general bajas (menores a 0.1), lo que implica que estas variables categóricas, aunque relevantes, no tienen un impacto directo o individual significativo en el modelo.
Tratamiento de valores nulos¶
En las variables categóricas, los valores nulos se suelen sustituir por una nueva clase: "sin valor" o por la moda
pd_data_train[pd_data_bool].isna().sum()
EMERGENCYSTATE_MODE 0 CODE_GENDER 4 FLAG_DOCUMENT_10 0 FLAG_DOCUMENT_2 0 FLAG_DOCUMENT_3 0 FLAG_DOCUMENT_4 0 FLAG_DOCUMENT_5 0 FLAG_DOCUMENT_6 0 FLAG_DOCUMENT_7 0 FLAG_DOCUMENT_8 0 FLAG_DOCUMENT_9 0 FLAG_DOCUMENT_11 0 FLAG_DOCUMENT_13 0 FLAG_DOCUMENT_14 0 FLAG_DOCUMENT_15 0 FLAG_DOCUMENT_16 0 FLAG_DOCUMENT_17 0 FLAG_DOCUMENT_18 0 FLAG_DOCUMENT_19 0 FLAG_DOCUMENT_20 0 FLAG_DOCUMENT_21 0 FLAG_OWN_REALTY 0 FLAG_OWN_CAR 0 FLAG_DOCUMENT_12 0 REG_REGION_NOT_LIVE_REGION 0 REG_REGION_NOT_WORK_REGION 0 LIVE_REGION_NOT_WORK_REGION 0 REG_CITY_NOT_LIVE_CITY 0 REG_CITY_NOT_WORK_CITY 0 LIVE_CITY_NOT_WORK_CITY 0 FLAG_EMAIL 0 FLAG_PHONE 0 FLAG_CONT_MOBILE 0 FLAG_WORK_PHONE 0 FLAG_EMP_PHONE 0 FLAG_MOBIL 0 TARGET 0 dtype: int64
missing_gender = pd_data_train[pd_data_train['CODE_GENDER'].isnull()]
target_dist = missing_gender['TARGET']
print("Target distribution for missing code_gender:")
print(target_dist)
Target distribution for missing code_gender: SK_ID_CURR 144669 0 319880 0 141289 0 196708 0 Name: TARGET, dtype: int64
pd_data_train = pd_data_train.dropna(subset=['CODE_GENDER'])
Podemos observar que las variables booleanas, en general, no presentan valores nulos, con la excepción de la variable CODE_GENDER, que tiene únicamente 4 valores faltantes. Dado que este número es pequeño, se ha decidido eliminar las cuatro filas que contienen esos valores, ya que representan un porcetaje insignificante del total de filas, y ninguno tiene valor 1 en la variable objetivo. Esto nos permite seguir considerando la variable como boolena, y ahorrarnos una nueva categoría que sólo se presentaría 4 veces.
En las variables categóricas generales, se observan varias columnas con cantidades considerables de valores nulos:
pd_data_train[pd_data_cat].isna().sum()
FONDKAPREMONT_MODE 168048 WALLSMATERIAL_MODE 124874 HOUSETYPE_MODE 123195 OCCUPATION_TYPE 77124 ORGANIZATION_TYPE 44460 NAME_TYPE_SUITE 1050 REGION_RATING_CLIENT_W_CITY 0 WEEKDAY_APPR_PROCESS_START 0 NAME_INCOME_TYPE 0 REGION_RATING_CLIENT 0 NAME_HOUSING_TYPE 0 NAME_FAMILY_STATUS 0 NAME_EDUCATION_TYPE 0 NAME_CONTRACT_TYPE 0 dtype: int64
for col in pd_data_cat:
pd_data_train[col] = pd_data_train[col].astype('category').cat.add_categories('Sin valor').fillna('Sin valor')
En las variables categóricas generales se encuentran variables con cantidades significativas de valores nulos, como: FONDKAPREMONT_MODE (168,050 valores nulos), WALLSMATERIAL_MODE (124,875 valores nulos), HOUSETYPE_MODE (123,196 valores nulos). Otras variables como NAME_TYPE_SUITE presentan menos valores nulos (1,050). Optamos por cambiarlos a una nueva categoría Sin valor, ya que al ser tantos, sustituirlos por la moda podría afectar al funcinamiento del modelo.
Exportación parcial del DataSet dividido en Train y Test¶
Para acabar el notebook 02, procedemos a exportar a dos archivos CSV (test_pd_data_preprocessing_missing_outliers.csv y train_pd_data_preprocessing_missing_outliers.csv) el contenido del dataset ya dividido en Train y Test.
pd_data_train.to_csv("../data/data_train_f.csv")
pd_data_test.to_csv("../data/data_test_f.csv")